Environment 2¶

Policy Gradient algorithms¶

Introduction¶

In this notebook, the goal of this environment is to implement 4 policy gradient algorithms: the regular policy gradient (REINFORCE), the policy gradient with a baseline subtracted (REINFORCE with baseline), the one-step actor-critic method, and the advantage actor-critic method. Our agent will play a game in continuous state space, for which we will use a basis function representation to limit ourselves to a finite amount of parameters.

Let's start with a few imports:

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
In [2]:
from environments.environment2 import *

import matplotlib
import matplotlib.cm
import matplotlib.colors
import numpy as np
import random
from tqdm import tqdm
from scipy.special import *

# reasonable plot size
plt.rcParams['figure.dpi'] = 100

The environment¶

The environment is a 2D grid, with a goal in the center, and a "U"-shaped barrier. Our goal is to learn how to reach the objective in the center regardless of where we start. In particular, we will study how the agent learns to go around the barrier when starting at the bottom of the grid.

  • The grid is 10x10, going from $(-5, -5)$ to $(5, 5)$.
  • The objective is a circle of radius $0.5$ centered at $(0, 0)$, that must be reached for the agent to win.
  • The barrier is a "U"-shaped wall of width $1$ centered at $(0, 0)$.
  • The agent is not allowed to exit the grid, or enter the wall.

Here is what the grid looks like: (the objective is in green, the walls are in red).

In [3]:
env = UMaze()
env.render()

Action and reward modeling¶

We assume that the state space is continuous, i.e. that the agent can be at any allowed position in the grid. At each position, we can move a fixed distance $0.2$ horizontally, vertically, or diagonally. Hence, there are $8$ possible actions:

  • Action $0$: move by $0.2 \cdot (0, 1)$.
  • Action $1$: move by $0.2 \cdot (\sqrt{2}/2, \sqrt{2}/2)$.
  • Action $2$: move by $0.2 \cdot (1, 0)$.
  • Action $3$: move by $0.2 \cdot (\sqrt{2}/2, -\sqrt{2}/2)$.
  • Action $4$: move by $0.2 \cdot (0, -1)$.
  • Action $5$: move by $0.2 \cdot (-\sqrt{2}/2, -\sqrt{2}/2)$.
  • Action $6$: move by $0.2 \cdot (-1, 0)$.
  • Action $7$: move by $0.2 \cdot (-\sqrt{2}/2, \sqrt{2}/2)$.

We grant a reward of $1$ when the agent reaches the objective. Actions that would cause the agent to hit a wall are ignored.

Note: the agent is not penalized when it makes a wrong move, otherwise it might become scared of walls and get stuck below the U.

Policy and value functions¶

To model our policy function for the continous state space, we first need to give each state and each action a value. For that, we will use a grid of $N$ gaussian basis functions with a fixed standard deviation $\sigma$. We will study how these two parameters influence the speed and quality of learning.

Let's define the action value function $q(\vec s, a)$ as follows: $$q(\vec s, a) = \sum_{i=1}^N w_{i,a} \exp \left(- \frac{||\vec s - \vec x_i||^2}{2\sigma^2}\right)$$ where $\vec x_i$ denotes the position of the $i$-th basis function. For our policy function $\pi$, we will use a softmax policy: $$\pi(\vec s, a) = \frac{\exp {q(\vec s, a)}}{\sum_{b=0}^7 \exp {q(\vec s, b)}}.$$ Our goal is to learn the $N \times 8$ weights $w_{i, a}$.

For some algorithms, we will also need to learn a state value function $v(\vec s)$. Once again, we will use basis functions: $$v(\vec s) = \sum_{i=1}^N \theta_{i} \exp \left(- \frac{||\vec s - \vec x_i||^2}{2\sigma^2}\right).$$ For the value function, there are $N$ parameters to learn: the weights $\theta_i$.

Policy gradient (REINFORCE)¶

The standard policy gradient only uses the policy function and the gradient of the log-policy $\nabla_{w} \ln \pi(\vec s, a)$. Here is the pseudocode:

Reinforce

Policy gradient with baseline subtracted (REINFORCE with baseline)¶

We can subtract a baseline in the policy gradient algorithm. As seen in the lecture, that should improve the speed of learning. We will use the value function as the baseline, and we will also need its gradient to update it.

ReinforceWithBaseline

One-step actor-critic¶

In the previous two algorithms, we can only perform the update at the end of the episode. The disadvantages are that we need to wait longer before we can learn, and we also use more memory to store all the past states, actions and rewards. With the one-step actor-critic algorithm, the updates are instead performed in real time.

OneStepActorCritic

Actor-critic with eligibility traces¶

We also look at the eligibility trace version of the actor-critic algorithm, also known as Advantage Actor-Critic.

EligibilityTraceActorCritic

Implementation¶

Your goal is to implement the algorithms explained above, and observe the following:

  • How the number of basis functions influences the speed of learning.
  • How the standard deviation of the basis functions influences the speed of learning.
  • How the choice of algorithm influences the speed of learning.

Here are a few important notes:

  • In each episode, we pick the starting position at random.
  • The discount factor $\gamma$ needs to be very close to $1$ due to the high number of steps per episode. $0.99$ is a good choice.
  • To avoid wasting too much time in the early episodes, we enforce a hard limit on the number of steps per episode. $300$ is a good limit.

Core model functions¶

Here are a few core functions that we will use in all 3 algorithms. The ModelParams class is used to store the discount factor $\gamma$, the count and shape of the basis functions, and the weights $w_{i,a}$ and $\theta_i$. Please complete the code as necessary.

In [4]:
class ModelParams:
    def __init__(self, npoints=13, gamma=0.99, stdev=0.5):
        self.action_weights = np.zeros((npoints * npoints, 8))
        self.value_weights = np.zeros(npoints * npoints)
        # need to be VERY careful here given the large number of iterations
        # if gamma is too small we'll have a hard time updating the policy correctly!
        self.gamma = gamma
        self.stdev = stdev

        self.npoints = npoints
        positions = np.linspace(-6, 6, npoints)  
        x_pos, y_pos = np.meshgrid(positions, positions)
        self.x_pos = x_pos.flatten()
        self.y_pos = y_pos.flatten()

        self.w_cache = None
        self.w_cache_pos = None

    def get_interpolation_weights(self, position: State) -> np.ndarray:
        """
        This returns an array of size N, weight for each i: exp(-||s-x_i||^2/(2\sigma^2))
        position: a list of two values: first for x pos; second for y pos
        """
        # We do a bit of caching to not compute this every time
        if self.w_cache_pos == position:
            return self.w_cache

        w = np.exp(- ((self.x_pos - position[0]) ** 2 + (self.y_pos - position[1]) ** 2) / (2 * self.stdev ** 2))
        self.w_cache = w
        self.w_cache_pos = position
        return w

    def compute_value(self, position: State) -> float:
        """
        This computes the v(s) function.
        """
        w = self.get_interpolation_weights(position)
        v = np.sum(w*self.value_weights)
        return v
    
    def get_action_values(self, position: State) -> np.ndarray:
        """
        This computes [q(s,a) for all a] for a specific s.
        """
        w = self.get_interpolation_weights(position)
        q =  np.sum(np.expand_dims(w,1)*self.action_weights, axis=0)
        return q
    
    def get_action_probabilities(self, position: State) -> np.ndarray:
        """
        This computes the [\pi(s, a) for all a].
        """
        q = self.get_action_values(position)
        pi = softmax(q)
        return pi
    
    def get_action_probability(self, position: State, action: Action) -> float:
        """
        This computes \pi(s, a).
        action: an int of [0,7]
        """
        return self.get_action_probabilities(position)[action]

    def pick_action(self, position: State) -> Action:
        """
        This randomly chooses an action following the policy.
        """
        probs = self.get_action_probabilities(position)
        # np.random.choice is a bit slow, so we code it by hand
        r = random.random()
        for i, p in enumerate(probs):
            r -= p
            if r < 0:
                return i

    def compute_action_log_grad(self, position: State, action: Action) -> np.ndarray:
        """
        This computes \nabla_w \ln(\pi(s,a))
        """
        pi = self.get_action_probabilities(position)
        w = self.get_interpolation_weights(position)
        grad = np.zeros_like(self.action_weights)
        grad = - np.outer(w, pi)
        grad[:,action] += w

        return grad
        

    def compute_value_grad(self, position: State) -> np.ndarray:
        """
        This computes \nabla_\theta v(s).
        """
        w = self.get_interpolation_weights(position)
        return w
        

def find_starting_pos(env: UMaze) -> State:
    """
    This picks a random starting position inside the bounds of the environment.
    """
    pos_candidate = None
    while pos_candidate is None:
        pos_candidate = (random.uniform(-5, 5), random.uniform(-5, 5))
        # the two below are easier, and can be used for debugging
#         pos_candidate = (0, 4)
#         pos_candidate = (random.uniform(-1.5, 1.5), random.uniform(-1.5, 1.5))
        if env.is_outside_of_area(*pos_candidate) or env.is_in_objective(*pos_candidate):
            pos_candidate = None
    return pos_candidate

Plots and helpers¶

Here are a few functions to provide nice visualization for the learned value and policy functions.

In [5]:
def render_values(params: ModelParams):
    """
    Visualises the v(s) function for all possible states.
    """
    xs = np.linspace(-6, 6, 101)
    ys = np.linspace(-6, 6, 101)
    X, Y = np.meshgrid(xs, ys, indexing="ij")
    values = np.zeros_like(X)
    for i in range(101):
        for j in range(101):
            values[i, j] = params.compute_value((xs[i], ys[j]))
    #plt.imshow(values, extent=[-6, 6, -6, 6])
    cs = plt.contourf(X, Y, values, levels=10, cmap="jet")
    plt.xlabel("x")
    plt.ylabel("y")
    cbar = plt.colorbar(cs)
    cbar.set_label("Value")
    plt.title("Computed values")

def render_actions(params: ModelParams):
    """
    This visualises the most probable action in each state using a colored arrow. 
    Color of the arrow indicates probability.
    Only shows a few discrete points.
    """
    colormap = matplotlib.cm.get_cmap("jet")

    xs = np.linspace(-5, 5, 21)
    ys = np.linspace(-5, 5, 21)
    for x in xs:
        for y in ys:
            action_probabilities = params.get_action_probabilities((x, y))
            action = np.argmax(action_probabilities)
            max_probability = action_probabilities[action]
            plt.plot(x, y, "o", color="black", markersize=0.1)
            plt.arrow(x, y, 0.3 * MOVES[action][0], 0.3 * MOVES[action][1], head_width=0.1, head_length=0.1, color=colormap(max_probability), linewidth=1)

    plt.xlabel("x")
    plt.ylabel("y")
    cbar = plt.colorbar(matplotlib.cm.ScalarMappable(norm=matplotlib.colors.Normalize(vmin=0, vmax=1), cmap=colormap))
    cbar.set_label("Probability")
    plt.title("Preferred actions")

Here is a function that you can use to run a learning algorithm, and display a nice visualization of the results.

In [6]:
from IPython.display import Markdown

def run_algorithm(env: UMaze, params: ModelParams, algorithm, plot_values, num_episodes: int, bucket_size: int = 500, **kwargs):
    """
    This function runs an algorithm for a specified number of episodes.
    Its arguments are:
    * env(UMaze): Instance of the environment
    * params(ModelParam): Parameters given to the model.
    * algorithm(function that accepts env and params as arguments and returns a Tuple[float, int]): Your algorithm.
    * plot_values(bool): If True, it will plot the values using one of the helper functions above.
    * num_episodes(int): Number of episodes to play.
    * bucket_size(int): Specifies how big groups should be when computing averages, i.e. a size of 500 averages a reward for diplay over 500 episodes.
    * Extra arguments specified like "action_rate=0.1" will be passed to the algorithm
    """
    assert num_episodes % bucket_size == 0, "num_episodes must be a multiple of bucket_size"

    num_buckets = num_episodes // bucket_size
    rewards = np.zeros(num_buckets)
    lengths = np.zeros(num_buckets)
    for i in tqdm(range(num_episodes)):
        random.seed(i) # set the seed to the episode id to ensure reproducibility
        episode_reward, episode_length = algorithm(env, params, **kwargs)
        rewards[i // bucket_size] += episode_reward / bucket_size
        lengths[i // bucket_size] += episode_length / bucket_size
    plot_points = (np.arange(num_buckets) + 1) * bucket_size

    display(Markdown("#### %s (sigma=%.2f, npoints=%d^2)" % (algorithm.__name__, params.stdev, params.npoints)))
    plt.subplot(221)
    plt.plot(plot_points, rewards)
    plt.title("Fraction of episodes where the goal is reached before %d iterations" % 300) # Max iterations is hardcoded to 300 by default
    plt.xlabel("Episode")
    plt.ylabel("Fraction")

    plt.subplot(222)
    plt.plot(plot_points, lengths)
    plt.title("Episode length over time")
    plt.xlabel("Episode")
    plt.ylabel("Episode length")

    plt.subplot(223)
    render_actions(params)

    if plot_values:
        plt.subplot(224)
        render_values(params)

    plt.subplots_adjust(wspace=0.2, hspace=0.3)
    #plt.tight_layout()
    plt.gcf().set_size_inches(12, 8)
    plt.show()

REINFORCE¶

First, implement the reinforce algorithm:

In [7]:
def reinforce(env: UMaze, params: ModelParams, action_rate = 5e-2, max_iter = 300) -> Tuple[float, int]:
    """
    Run an episode and mutate the params in-place.
    :return: tuple containing the (total) reward for the episode, and the number of steps
    """
    env.reset()
    # Pick random starting position
    env._state = find_starting_pos(env)
    states = [env._state]
    actions = []
    rewards = []
    iteration = 0
    while iteration < max_iter and not env.end:
        # Compute action
        action = params.pick_action(states[-1])
        actions.append(action)
        # Do action
        state, reward = env.do_action(action)
        states.append(state)
        rewards.append(reward)
        iteration += 1
    
    rewards = np.array(rewards)
    for t in range(len(actions)):
        G = np.sum(np.power(params.gamma, np.arange(t, len(actions))-t)*rewards[t:])
        params.action_weights += action_rate * params.gamma**t * G * params.compute_action_log_grad(states[t], actions[t])
    
    return sum(rewards), iteration

Now, you can perform a few experiments!

In [8]:
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=9), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:24<00:00, 26.03it/s]

reinforce (sigma=0.25, npoints=9^2)¶

In [9]:
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=9), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [04:31<00:00, 36.87it/s]

reinforce (sigma=0.50, npoints=9^2)¶

In [10]:
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=9), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:10<00:00, 76.75it/s]

reinforce (sigma=0.75, npoints=9^2)¶

In [11]:
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=9), reinforce, plot_values=False, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:34<00:00, 106.38it/s]

reinforce (sigma=1.00, npoints=9^2)¶

In [12]:
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=9), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:03<00:00, 81.14it/s]

reinforce (sigma=1.50, npoints=9^2)¶

In [13]:
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=13), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:26<00:00, 25.90it/s]

reinforce (sigma=0.25, npoints=13^2)¶

In [14]:
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=13), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:56<00:00, 56.74it/s]

reinforce (sigma=0.50, npoints=13^2)¶

In [15]:
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=13), reinforce, plot_values=False, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:37<00:00, 102.77it/s]

reinforce (sigma=0.75, npoints=13^2)¶

In [16]:
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=13), reinforce, plot_values=False, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:30<00:00, 110.86it/s]

reinforce (sigma=1.00, npoints=13^2)¶

In [17]:
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=13), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [03:50<00:00, 43.40it/s]

reinforce (sigma=1.50, npoints=13^2)¶

In [18]:
run_algorithm(UMaze(), ModelParams(stdev=0.15, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [07:25<00:00, 22.46it/s]

reinforce (sigma=0.15, npoints=19^2)¶

In [19]:
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:17<00:00, 26.51it/s]

reinforce (sigma=0.25, npoints=19^2)¶

In [20]:
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:07<00:00, 78.54it/s]

reinforce (sigma=0.50, npoints=19^2)¶

In [21]:
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:39<00:00, 100.46it/s]

reinforce (sigma=0.75, npoints=19^2)¶

In [22]:
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [01:44<00:00, 95.34it/s]

reinforce (sigma=1.00, npoints=19^2)¶

In [23]:
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:04<00:00, 32.80it/s]

reinforce (sigma=1.50, npoints=19^2)¶

REINFORCE with baseline¶

Now, copy/paste and modify your reinforce function above to subtract a basline.

In [26]:
def reinforce_with_baseline(env: UMaze, params: ModelParams, action_rate = 5e-2, value_rate = 5e-2, max_iter = 300) -> Tuple[float, int]:
    """
    Run an episode and mutate the params in-place.
    :return: tuple containing the (total) reward for the episode, and the number of steps
    """
    env.reset()
    # Pick random starting position
    env._state = find_starting_pos(env)
    states = [env._state]
    actions = []
    rewards = []
    iteration = 0
    while iteration < max_iter and not env.end:
        # Compute action
        action = params.pick_action(states[-1])
        actions.append(action)
        # Do action
        state, reward = env.do_action(action)
        states.append(state)
        rewards.append(reward)
        iteration += 1
    
    rewards = np.array(rewards)
    for t in range(len(actions)):
        G = np.sum(np.power(params.gamma, np.arange(t, len(actions))-t)*rewards[t:])
        delta = G - params.compute_value(states[t])
        params.value_weights += value_rate * delta * params.compute_value_grad(states[t])
        params.action_weights += action_rate * params.gamma**t * delta * params.compute_action_log_grad(states[t], actions[t])
    
    return sum(rewards), iteration
    

Now, you can perform a few experiments!

In [28]:
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=9), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:44<00:00, 28.99it/s]

reinforce_with_baseline (sigma=0.25, npoints=9^2)¶

In [29]:
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=9), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [04:20<00:00, 38.32it/s]

reinforce_with_baseline (sigma=0.50, npoints=9^2)¶

In [30]:
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=9), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:52<00:00, 58.12it/s]

reinforce_with_baseline (sigma=0.75, npoints=9^2)¶

In [31]:
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=9), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:23<00:00, 69.76it/s]

reinforce_with_baseline (sigma=1.00, npoints=9^2)¶

In [32]:
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=9), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:39<00:00, 62.72it/s]

reinforce_with_baseline (sigma=1.50, npoints=9^2)¶

In [33]:
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:55<00:00, 28.12it/s]

reinforce_with_baseline (sigma=0.25, npoints=13^2)¶

In [34]:
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [03:29<00:00, 47.78it/s]

reinforce_with_baseline (sigma=0.50, npoints=13^2)¶

In [35]:
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:30<00:00, 66.42it/s]

reinforce_with_baseline (sigma=0.75, npoints=13^2)¶

In [36]:
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:04<00:00, 80.56it/s]

reinforce_with_baseline (sigma=1.00, npoints=13^2)¶

In [37]:
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [03:12<00:00, 51.98it/s]

reinforce_with_baseline (sigma=1.50, npoints=13^2)¶

In [38]:
run_algorithm(UMaze(), ModelParams(stdev=0.15, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:39<00:00, 25.04it/s]

reinforce_with_baseline (sigma=0.15, npoints=19^2)¶

In [39]:
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:52<00:00, 28.36it/s]

reinforce_with_baseline (sigma=0.25, npoints=19^2)¶

In [40]:
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [03:12<00:00, 52.03it/s]

reinforce_with_baseline (sigma=0.50, npoints=19^2)¶

In [41]:
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:35<00:00, 64.22it/s]

reinforce_with_baseline (sigma=0.75, npoints=19^2)¶

In [42]:
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:13<00:00, 75.14it/s]

reinforce_with_baseline (sigma=1.00, npoints=19^2)¶

In [43]:
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:10<00:00, 32.23it/s]

reinforce_with_baseline (sigma=1.50, npoints=19^2)¶